Task 2: Anomaly Detection¶
Introduction¶
The objective of this task is to identify unusual or suspicious vehicle offers within the dataset. Data exploration is conducted to detect potential anomalies, followed by the application of appropriate methods to identify outliers. The findings are clearly interpreted, and possible explanations for the detected anomalies are discussed to provide insights into the nature of the data irregularities.
Background¶
The information presented in this report is gathered from the following sources:
- Information outlined in the project requirements document;
- Details provided on Kaggle;
- Documentation and code traced back through GitHub commits.
Before diving into the analysis, it is essential to understand the nature of the data. This step is critical as it guides actions such as:
- Making reasonable assumptions about the data;
- Handling duplicated and missing values;
- Interpreting and understanding the results of this report.
Data¶
The dataset has the following characteristics:
Original source: The data comes from Otomoto.pl, a popular Polish car sales platform. It consists of self-reported information from individuals and agencies. Most fields are filled using dropdown menus, while numeric fields allow users to input their values. The platform also offers unstructured data, such as images and detailed car descriptions, though these are not included in the dataset.
Method of collection: The dataset was scraped from the Otomoto.pl website by a student from Warsaw's Military University of Technology as part of their coursework. It represents a snapshot of the platform’s data at a specific point in time, approximately three years ago on December 4, 2021.
Scope: The dataset includes 208,304 observations across 25 variables.
Timeline: The timeline based on the offer publication date ranges from March 26, 2021 to May 5, 2021.
1. Data Preparation and EDA¶
1.a. Duplicated and Missing Values¶
Before proceeding with any further work, it is essential to ensure that any duplicate values are removed from the dataset. In a business context, this refers to ads that contain identical information. These duplicates typically arise when the website logic fails to filter out identical listings. Below are some key statistics related to this matter:
Next, we will evaluate the completeness of the data by measuring the number of non-empty values for each variable. I have categorized the variables into three groups as follows:
- Green: Fully usable variables.
- Yellow: Variables with an acceptable level of completeness, where it is reasonable to remove NAs and proceed.
- Red: Variables with an unacceptable level of completeness, requiring removal.
1.b. Pre-Processing and Feature Engineering¶
At this stage, I will categorize the variables into two groups: numeric and categorical. Based on the variable type, I will perform the following actions:
- Numeric variables: Pre-process and proceed with all meaningful variables.
- Categorical variables: Identify variables with a low number of levels and either apply one-hot encoding or transform them into a numeric format. As you may have noticed, I am placing significant emphasis on converting all variables into numeric format, as this is a requirement for certain dimensionality reduction and clustering methods.
For the categorical variables mentioned above, we will focus on those with a lower number of levels or classes: Drive, Condition, and Transmission. Due to their limited number of levels, these variables are easier to convert into dummy variables for interpretation:
- Condition and Transmission: Suitable for one-hot encoding.
- Drive: Contains too many levels. I may consider combining some levels into a new category called 4x4 (all), but first, I will examine the behavior within the existing classes.
Additionally, some categorical variables can be converted into numeric format for improved usability and comparability:
- Features: Transform into a numeric variable, Number_of_features.
- Offer_publication_date: Convert into Days_on_market.
For numeric values, there are some quick wins in transformation activities. I am applying the following:
- Price: Using the Currency column, convert to Price_in_CAD for improved interpretability.
- Production_year: Transform into Vehicle_age for easier interpretation.
Now that we have created all the features and selected the in-scope variables, let's examine two plots:
- Correlation Matrix: This will allow us to visually assess the relationships between variables.
- Distribution Plots: This helps identify outliers that may introduce unnecessary noise, potentially obscuring important signals.
It is important to note that the data is self-reported, which means we may question the reasonability of certain values. The variables outlined below have been capped to remove extremely high values that could introduce noise into the data. Some high values were retained as long as the distribution appears to follow a continuous scale. For capping purposes, the following limits were applied:
- Mileage_km: Capped at 1,000,000.
- Doors_number: Capped at 6.
- Prince_in_CAD: Capped at 1,000,000.
2. Dimensionality Reduction¶
Principal Component Analysis (PCA) is a widely used technique for reducing the dimensionality of datasets. It enhances interpretability while minimizing information loss. The process involves two key steps:
Identifying the Number of Principal Components: This step determines how many components are needed to explain a significant portion of the variance in the data. The most common approach is to examine a scree plot or cumulative variance plot to identify the optimal number of components.
Performing PCA: After determining the appropriate number of components, PCA is applied to transform the data into the new set of components.
The scree plot above illustrates how the explained variance changes as the number of principal components increases. To determine the optimal number of components, we aim for a cumulative explained variance between 80% and 90%. This decision is somewhat subjective; in this case, 80%–90% cumulative explained variance corresponds to a range of 5 to 7 principal components. I am comfortable selecting 5 principal components.
Using PCA not only improves the efficiency of clustering techniques by reducing computation time but also helps mitigate random noise in the data and makes clusters more distinguishable.
3. Anomaly Detection¶
At this stage, I am applying a clustering technique called DBSCAN. In the preceding sections, I prepared the data by reducing its dimensionality and ensuring that only numeric values are supplied. One characteristics of the DBSCAN is that it also assigns anomalies into a separate cluster.
DBSCAN Clustering¶
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density, assigning points in low-density areas as noise. There are two key parameters that need to be tuned:
- Eps: The maximum distance between two points for them to be considered neighbors.
- Min_samples: The minimum number of points required to form a dense cluster.
Similar to the approach used for k-means clustering, we can estimate the Eps value based on the K-distance graph and the elbow method. For Min_Samples, the general recommendation is to consider the dimensionality of the data and choose a value around 2 times number of variables. Based on these steps, the recommended values are:
- Eps: 0.5
- Min_Samples: 30
The DBSCAN clustering results yield a Silhouette Score of 0.320, which indicates moderate clustering quality.
From the scatter plot:
- The red points labeled as cluster outlier represent noise or outliers, effectively identified by DBSCAN - this is what we are looking for.
4. Anomaly Analysis and Interpretation¶
Let's review some of the assumptions of the method mentioned earlier to determine if any are being violated:
DBSCAN:
- Dense clusters: Groups points based on high-density regions rather than shape.
- Noise exists: Capable of identifying outliers or points that do not belong to any cluster.
- No fixed cluster size: Can accommodate clusters of varying sizes and shapes.
- Varying density: Works well with clusters that have different densities.
Based on the following plots, approximately 7% of the data is classified as outliers. Let’s now compare the behavior of categorical and numeric variables for both outliers and non-outliers.
From the categorical in-scope variables, we observe the following:
- Condition: There doesn’t appear to be a relationship between vehicle condition and outliers.
- Transmission: Outliers occur more frequently in vehicles with automatic transmission when considering the ratios.
Based on the numeric values, we can make the following observations:
- Price: Price has been a major driver of PCA, explaining a significant portion of the variability in the data. Vehicles with substantially higher prices are classified as outliers.
- Power (HP): There is a clear trend where outliers tend to have higher horsepower.
5. Conclusions¶
In this report, I’ve explored a clustering approach to identify outliers in the data. Although DBSCAN does not show great performance based on the silhouette score, one of its benefits is its ability to create clusters and isolate outliers that do not belong to any cluster. As a result of my analysis, I can highlight the following characteristics of the outliers:
Price: Price has been a major driver of PCA, explaining a significant portion of the variability in the data. Vehicles with substantially higher prices are classified as outliers. This suggests that unusually high-priced vehicles may not fit typical market patterns and should be further examined.
Power (HP): There is a clear trend where outliers tend to have higher horsepower. This indicates that vehicles with extremely high power values might be flagged as outliers and warrant closer inspection.
Condition: There doesn’t appear to be a strong relationship between vehicle condition and outliers. The condition of the vehicle seems to have minimal impact on whether it is classified as an outlier, suggesting that other factors might be more significant in identifying outliers.
Transmission: Outliers occur more frequently in vehicles with automatic transmission when considering the ratios. This indicates that automatic transmission vehicles may be more prone to being classified as outliers, possibly due to pricing or other factors associated with these models.
The website may consider these characteristics when identifying suspicious listings and taking action to remove them.